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1  Star-P 

Star-P  is  an  interactive  parallel  scientific  comput- 
ing  environment.  It  aims  to  make  parallel  program¬ 
ming  more  accessible.  Star-P  borrows  ideas  from 
Matlab*P  [3],  but  is  a  new  development.  Currently 
only  a  Matlab  interface  for  Star-P  is  available, 
but  it  is  not  limited  to  being  a  parallel  Matlab.  It 
combines  all  four  parallel  Matlab  approaches  in  one 
environment,  as  described  in  the  parallel  Matlab 
survey  [2]:  embarrassingly  parallel,  message  pass¬ 
ing,  backend  support  and  compilation.  It  also  in¬ 
tegrates  several  parallel  numerical  libraries  into  one 
single  easy-to-use  piece  of  software. 

The  focus  of  Star-P  is  to  improve  user  productiv¬ 
ity  in  parallel  programming.  We  believe  that  Star- 
P  can  dramatically  reduce  the  difficulty  of  program¬ 
ming  parallel  computers  by  reducing  the  time  needed 
for  development  and  debugging. 

To  achieve  productivity,  it  is  important  that  the 
user  interface  is  intuitive  to  the  user.  For  example, 
a  large  class  of  scientific  users  are  already  familiar 
with  the  Matlab  language.  So  it  is  beneficial  to  use 
Matlab  as  a  parallel  programming  language.  Ad¬ 
ditions  to  the  language  are  minimal  in  keeping  with 
the  philosophy  to  avoid  re-learning.  Also,  as  a  design 
goal,  our  system  does  not  distinguish  between  serial 
data  and  parallel  data. 

C  =  op(A,B) 
print (C) 

C  will  be  the  same  whether  A  and  B  are  distributed 
or  not.  This  will  allow  the  same  piece  of  code  to 
run  sequentially  (when  all  the  arguments  are  serial) 
or  in  parallel  (when  at  least  one  of  the  arguments  is 
distributed). 
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2  Functionality 

Where  possible,  Star-P  leverages  existing,  estab¬ 
lished  parallel  numerical  libraries  to  perform  numer¬ 
ical  computation.  This  idea  is  inherited  from  Mat¬ 
lab*?.  Several  libraries  already  exist  which  provide  a 
wide  range  of  linear  algebra  and  other  routines,  and 
it  would  be  inefficient  to  reproduce  them.  Instead, 
Star-P  integrates  them  seamlessly  for  the  user. 

3  RT-STAP  Benchmarks 

The  RT-STAP  (Real-Time  Space-Time  Adaptive 
Processing)  benchmark  [1]  is  a  benchmark  for  real¬ 
time  signal  processing  systems  developed  by  the 
MITRE  Corporation.  In  the  hard  version  of  the 
benchmark  which  we  run,  the  input  to  the  Matlab 
code  is  a  data  cube  of  22  (channels)  x  64  x  480  dou¬ 
bles.  The  code  performs  the  following  three  steps: 

1.  Convert  the  input  data  to  baseband. 

2.  Doppler  processing. 

3.  Weight  computation  and  application  to  find  the 
range-Doppler  matrix. 

Upon  execution,  we  noticed  that  step  1  was  the 
most  time  consuming  step.  This  is  surprising,  since 
the  weight  computation  would  be  expected  to  have 
the  highest  FLOP  count.  It  turns  out  that  this  is  due 
to  the  Matlab  coding  style  used  in  the  benchmark 
code.  Since  the  point  of  the  benchmark  is  to  measure 
the  running  time  of  a  typical  application,  we  chose  to 
proceed  without  modifying  the  code.  The  conversion 
step  in  the  original  Matlab  code  is  a  for  loop  as 
follows: 

f  or  channum= 1 : NCHAN 

xx  =  CPI1_INITIAL(: ,channum) ; 

CPU  (  :  ,  channum)  =  baseband_convert  (xx,  ... 

SOME. ARGUMENTS) ; 

end 
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Time  taken  by  conversion  step 


Performance  of  the  sparse  constructor 


Figure  1:  Scalability  of  RT_STAP 


Figure  2:  Scalability  of  sparse  (SGI  Altix  350) 


It  loops  over  the  input  channels  and  processes  them 
in  an  embarrassingly  parallel  fashion.  This  makes  it  a 
natural  candidate  for  Star- P’s  multi-MATLAB  mode. 
We  converted  the  loop  to  run  in  Star-P  by  changing 
the  loop  into  a  function  call  and  putting  it  in  an  mm- 
mode  call: 

P_CPI1_ INITIAL  =  matlab2pp(CPIl_INITIAL,2) ; 
CPU  =  mm (’ convert _loop; ,  SOME. ARGUMENTS) ; 
CPU  =  CPU  ( :  ,  : )  ; 

Note  that  the  calls  before  and  after  the  mm  call  are 
used  to  transfer  data  to  the  server  and  back.  The 
time  required  by  these  calls  is  also  included  in  our 
timings. 

Figure  1  compares  timing  results  for  sequential  Mat- 
lab  and  Star-P  on  2,  11  and  22  processors.  The 
solid  line  shows  the  timings  that  would  be  obtained 
if  the  code  scales  perfectly.  The  real  timings  follow 
the  solid  line  quite  closely  except  for  the  22  proces¬ 
sors  case.  Going  from  11  processors  to  22  processors 
provides  no  additional  benefits.  This  is  easy  to  ex¬ 
plain  in  terms  of  granularity.  As  the  input  data  cube 
only  has  22  channels,  with  22  processors,  each  pro¬ 
cessor  has  only  1  channel  of  work,  as  opposed  to  2 
channels  in  the  11  processors  case.  So  there  is  very 
little  to  gain  from  using  additional  processors,  and 
any  benefit  is  offset  by  the  additional  time  needed  for 
communication. 


algorithms  are  often  useful  in  signal  processing  and 
embedded  computing.  Sparse  matrix  operations  of¬ 
ten  display  poor  spatial  and  temporal  locality  result¬ 
ing  in  irregular  memory  access  patterns. 

A  very  basic  and  fundamental  sparse  matrix  op¬ 
eration  is  the  sparse  matrix  constructor  in  Matlab 
-  sparse.  It  constructs  a  distributed  sparse  matrix 
from  3  vectors  containing  the  row  and  column  num¬ 
bers  and  the  corresponding  non-zeros.  The  sparse 
constructor  has  fairly  general  applications  in  build¬ 
ing  and  updating  tables,  histograms,  and  sparse  data 
structures  in  general.  It  also  accumulates  and  adds 
duplicate  entries. 

Figure  2  shows  the  performance  of  sparse  on  two 
matrices:  one  very  large  but  sparse,  with  107  rows 
and  4  x  107  non-zero  entries;  the  other  smaller  and 
denser,  with  106  rows  and  32  x  106  non-zero  entries. 
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4  Sparse  matrix  capabilities 

Star-P  provides  basic  sparse  matrix  capabilities  [4] 
similar  to  those  found  in  Matlab.  Sparse  matrix 
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Dream  of  taking  academic  software 
commercial 


Birth  of  Interactive  Supercomputing 


Star-P 


Interactive  Parallel  Computing  Environment 

Parallel  Client/Server  Architecture 

Main  goal:  parallel  computing  easier  on  the 
human  user 

Academic  Front  End:  MATLAB 

Four  parallel  approaches  interacting: 

-  Embarrassingly  Parallel 

-  Message  Passing 

-  Backend  Support  (insert  *p) 

-  Compiling 

Integrates  several  packages  into  one  easy  to 
use  software 


Page  Rank  Matrix 


1176663 


•  Web  crawl  of  170,000  pages  from  mit.edu 

•  Matlab*P  spy  plot  of  the  matrix  of  the  graph 


Clock 


c=mm(‘clock’); 

std(c); 

Simple  example  shows  two  modes 
interacting 


Pieces  of  Pi 


»  quad( '  4 .  / (1+x. A2)  '  ,  0 ,  1)  ; 
ans  =  3.14159270703219 

»  a  =  (0 : 3*p)  /  4 
a  =  ddense  object:  l-by-4 

»  a(:) 
ans  = 

0 

0.25000000000000 

0.50000000000000 

0.75000000000000 

»  b  =  a+ . 25 ; 

»  c  =  mm( ' quad' , ' 4 . / (1+x. A2) ' ,  a,  b) ;  %  Should  be  "feval" 

c  =  ddense  object:  l-by-4 

»  sum  (c  ( : )  ) 

ans  =  3.14159265358979 


FFT2  in  four  lines 


»  A  =  randn(4096,  4096*p) 

A  =  ddense  object:  4096-by-4096 
»  tic; 

»  B  =  mm  ( ’  f  f  t 1  ,  A)  ; 

»  C  =  B .  '  ; 

»  D  =  mm(’fft'  ,  C)  ; 

»  F  =  D .  '  ; 

»  toe 

elapsed_time  =  73.50 

»a  =  A  ( :  ,  : )  ; 

»  tic;  g  =  fft2 (a) ;  toe 
elapsed  time  =  202.95 


. . .  we  have  FFTW  installed  as  well! 


Matlab  sparse  matrix  design  principles 

•  All  operations  should  give  the  same  results  for 
sparse  and  full  matrices  (almost  all) 

•  Sparse  matrices  are  never  created  automatically, 
but  once  created  they  propagate 

•  Performance  is  important  —  but  usability,  simplicity, 
completeness,  and  robustness  are  more  important 

•  Storage  for  a  sparse  matrix  should  be  O(nonzeros) 

•  Time  for  a  sparse  operation  should  be  O(flops) 

(as  nearly  as  possible) 


Matlab  sparse  matrix  design  principles 

•  All  operations  should  give  the  same  results  for 
sparse  and  full  matrices  (almost  all) 

•  Sparse  matrices  are  never  created  automatically, 
but  once  created  they  propagate 

•  Performance  is  important  —  but  usability,  simplicity, 
completeness,  and  robustness  are  more  important 

•  Storage  for  a  sparse  matrix  should  be  O(nonzeros) 

•  Time  for  a  sparse  operation  should  be  O(flops) 

(as  nearly  as  possible) 

Matlab*P  dsparse  matrices:  same  principles, 
but  some  different  tradeoffs 


Sparse  matrix  operations 


dsparse  layout,  same  semantics  as  ddense 

For  now,  only  row  distribution 

Matrix  operators:  +,  -,  max,  etc. 

Matrix  indexing  and  concatenation 
A  (1:3,  [4  5  2])  =  [  B(:,  7)  C  ] ; 

A  \  b  by  direct  methods 
Conjugate  gradients 


Sparse  data  structure 
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Full: 

2-dimensional  array  of 
real  or  complex  numbers 

(nrows*ncols)  memory 


Sparse: 

compressed  row  storage 

about  (1.5*nzs  +  .5*nrows) 
memory 


Distributed  sparse  data  structure 


Each  processor  stores: 

•  #  of  local  nonzeros 

•  range  of  local  rows 

•  nonzeros  in  CSR  form 


Sparse  matrix  times  dense  vector 
y  =  A  *  x 

The  first  call  to  matvec  caches  a 
communication  schedule  for  matrix  A. 
Later  calls  to  multiply  any  vector  by  A  use 
the  cached  schedule. 

Communication  and  computation  overlap. 

Can  use  a  tuned  sequential  matvec  kernel 
on  each  processor. 


Sparse  linear  systems 


x  =  A\b 

Matrix  division  uses  MPI-based  direct  solvers: 
SuperLU_dist:  nonsymmetric  static  pivoting 
MUMPS:  nonsymmetric  multifrontal 
PSPASES:  Cholesky 

ppsetoption (' SparseDirectSolver' , ' SUPERLU' ) 

Iterative  solvers  implemented  in  Matlab*P 
Some  preconditioners;  ongoing  work 


Application:  Fluid  dynamics 


Modeling  density-driven 
instabilities  in  miscible 
fluids  (Goyal,  Meiburg) 

Groundwater  modeling, 
oil  recovery,  etc. 

Mixed  finite  difference  & 
spectral  method 

Large  sparse  generalized 
eigenvalue  problem 


function  lambda  =  peigs  (A,  B, 
sigma,  iter,  tol) 

[m  n]  =  size  (A)  ; 

C  =  A  -  sigma  *  B; 
y  =  rand  (m,  1)  ; 

for  k  =  1  liter 

q  =  y  . /  norm  (y) ; 
v  =  B  *  q; 
y  =  C  \  v; 
theta  =  dot  (q,  y) ; 
res  =  norm  (y  -  theta*q) ; 
if  res  <=  tol 
break ; 
end; 
end; 

lambda  =  1  /  theta; 


Combinatorial  algorithms  in  Matlab*P 


Sparse  matrices  are  a  good  start  on  primitives 
for  combinatorial  scientific  computing. 

-  Random-access  indexing:  A  ( i ,  j ) 

-  Neighbor  sequencing:  find  (A (i , : ) ) 

-  Sparse  table  construction:  sparse  (I,  J,  V) 


What  else  do  we  need? 


Sorting  in  Matlab*P 


[V,  perm]  =  sort  (V) 

Common  primitive  for  many  sparse  matrix  and 
array  algorithms:  sparse(),  indexing,  transpose 

Matlab*P  uses  a  parallel  sample  sort 


Sample  sort 


•  (Perform  a  random  permutation) 

•  Select  p- 1  “splitters”  to  form  p  buckets 

•  Route  each  element  to  the  correct  bucket 

•  Sort  each  bucket  locally 

•  “Starch”  the  result  to  match  the  distribution 
of  the  input  vector 


Sample  sort  example 

Initial  data  (after  randomizing) 

3|6|8|1|5|4|7|2|9 

Choose  splitters  (2  and  6) 

1|2|3|6|5|4|8|7|9 

Sort  local  data 

1|2|3|4|5|6|7|8|9 

Starch 

Il2l3l4l5l6l7l8l9 


How  sparse( )  works 


A  =  sparse  (I,  J,  V) 

Input:  ddense  vectors  I,  J,  V  (optionally,  also 
dimensions  and  distribution  info) 

Sort  triples  (i,j,v)by  (i,j) 

Starch  the  vectors  for  desired  row  distribution 

Locally  convert  to  compressed  row  indices 
Sum  values  with  duplicate  indices 


Graph  /  mesh  partitioning 


•  Reduce  communication  in 
matvec  and  other  parallel 
computations 

•  Reordering  for  sparse  GE 

•  PARMETIS 

•  Parts  of  G/Teng  Matlab 
meshpart  toolbox 


Geometric  mesh  partitioning 


Algorithm  of  Miller,  Teng,  Thurston,  Vavasis 

Partitions  irregular  finite  element  meshes  into  equal-size 
pieces  with  few  connecting  edges 

Guaranteed  quality  partitions  for  well-shaped  meshes, 
often  very  good  results  in  practice 

Existing  implementation  in  sequential  Matlab 

Code  runs  in  Matlab*P  with  very  minor  changes 


Outline  of  algorithm 


1 .  Project  points  stereographically  from  Rd  to  Rd+1 

2.  Find  “centerpoinf  ’  (generalized  median) 

3.  Conformal  map:  Rotate  and  dilate 

4.  Find  great  circle 

5.  Unmap  and  project  down 

6.  Convert  circle  to  separator 


Geometric  mesh  partitioning 


Matching  and  depth-first  search  in  Matlab 

•  dmperm:  Dulmage-Mendelsohn  decomposition 

•  Square,  full  rank  A: 

-  [p,  q,  r]  =  dmperm(A); 

-  A(p,q)  is  block  upper  triangular  with  nonzero  diagonal 

-  also,  strongly  connected  components  of  a  directed  graph 

-  also,  connected  components  of  an  undirected  graph 

•  Arbitrary  A: 

-  [p,  q,  r,  s]  =  dmperm(A); 

-  maximum-size  matching  in  a  bipartite  graph 

-  minimum-size  vertex  cover  in  a  bipartite  graph 

-  decomposition  into  strong  Hall  blocks 


Connected  components 

Sequential  Matlab  uses  depth-first  search  (dmperm), 
which  doesn’t  parallelize  well 

Shiloach-Vishkin  algorithm: 

-  repeat 

•  Link  every  (super)vertex  to  a  random  neighbor 

•  Shrink  each  tree  to  a  supervertex  by  pointer  jumping 

-  until  no  further  change 

Originally  a  processor-efficient  PRAM  algorithm 
Matlab*P  code  looks  much  like  the  PRAM  code 


Pointer  jumping 


while  ~all ( 
C (myrows) 
end 

C  (myrows)  = 


C (myrows)  ==  C (C (myrows) )  ) 

=  C (C (myrows) ) ; 

min  (C (myrows) ,  C (C (myrows) ) ) 


Example  of  execution 


Page  Rank 

•  Importance  ranking 
of  web  pages 

•  Stationary  distribution 
of  a  Markov  chain 

•  Power  method:  matvec 
and  vector  arithmetic 

•  Matlab*P  page  ranking 
demo  (from  SC’03)  on 
a  web  crawl  of  mit.edu 
(170,000  pages) 


Remarks 

Easy-to-use  interactive  programming  environment 


Interface  to  existing  parallel  packages 

Combinatorial  methods  toolbox  being  built  on 
parallel  sparse  matrix  infrastructure 

-  Much  to  be  done:  spanning  trees,  searches,  etc. 

A  few  issues  for  ongoing  work 

-  Dynamic  resource  management 

-  Fault  management 

-  Programming  in  the  large 


